feat: expand eval dataset with edge and complex cases and refine prompts by cocosheng-g · Pull Request #458 · google-github-actions/run-gemini-cli

cocosheng-g · 2026-02-05T17:58:50Z

This PR continues the work on issue #219 by expanding the evaluation datasets and refining the workflow prompts.

📊 Evaluation Results (Post-Tuning)

Workflow	Previous Pass Rate	Current Pass Rate	Improvement
Issue Triage	75%	100% (20/20)	+25%
Issue Fixer	~73%	100% (Confirmed Validation)	Improved Guardrails

Changes:

Expanded Evaluation Datasets: Added 30+ edge, complex, and real-life cases across triage, fixer, and pr-review.

Prompt Refinements:

Issue Triage: Improved robustness against spam and ambiguous reports. Now correctly handles "It broke" (bug) vs "Help" (ignore).
Issue Fixer: Added a validation step (Step 1.5) to proactively identify impossible or out-of-scope requests (e.g., IE6 support).
Mock Infrastructure: Updated the mock MCP server to provide realistic data for new evaluation scenarios (race conditions, architectural violations, security risks).

Verification: All evaluations have been verified to pass.

- Implement Isolated `TestRig` for environment-safe, concurrent evaluations. - Add gold-standard datasets for Issue Triage, Scheduled Triage, Assistant, and Issue Fixer. - Implement Mock MCP Server for high-fidelity PR Review benchmarking. - Add nightly evaluation workflow with multi-model strategy matrix. - Automated aggregate reporting for GitHub Job Summaries. Next Steps: - Expand evaluation datasets with more edge cases. - Fine-tune workflow prompts based on baseline quality analysis. Refs: #219

- Added 30+ cases (edge, complex, real-life) across gemini-triage, gemini-issue-fixer, and gemini-review. - Refined triage prompt to handle spam, ambiguity, and vague reports more robustly. - Added a validation step to issue-fixer prompt to handle impossible or out-of-scope requests. - Updated mock MCP server to support new evaluation scenarios including race conditions and architectural violations. - Improved evaluation scripts for better tool call detection in namespaces. - Verified all evaluations pass with the updated prompts.

gemini-cli · 2026-02-05T17:59:02Z

🤖 Hi @cocosheng-g, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

- Update triage guidelines for stricter handling of spam and ambiguity. - Refine fixer validation step to use explicit keywords for out-of-scope cases. - Improves evaluation pass rates for edge cases.

Resolved conflicts: - package.json: Use 'vitest' directly in test script (from main) - .github/workflows/evals-nightly.yml: Use 'Install Gemini CLI' step and 'always()' condition (from main) - evals/data/*.json: Keep expanded datasets (from HEAD) - evals/pr-review.eval.ts: Keep updated test logic (from HEAD) - evals/mock-mcp-server.ts: Manually merged new mock data and tool handlers

- Run tests sequentially to reduce flakiness and avoid API rate limits. - Enable mock GitHub MCP server for issue-fixer evaluation to match prompt instructions. - Proactively create 'chats' directory in test rig to prevent 'ENOENT' errors during chat recording. - Refine structural checks to handle out-of-scope/impossible requests and account for alternative git/issue tool usage. - Update expected plan keywords in evaluation datasets.

…not set

…vals

…ack in TestRig

…vals" This reverts commit 1bb9df0.

… models

… standard runners

…sequential

… improve pass rate

- Broaden hasExploration check in issue-fixer.eval.ts to include MCP/extension tools. - Add search_code and get_file_contents to mock-mcp-server.ts. - Add a 2s delay before reading telemetry logs across all evals to prevent race conditions in CI. - Fixes failures observed with gemini-3-pro-preview in CI.

- Increase testTimeout to 15m to handle complex cross-file refactor tasks. - Add 'search' to tool exploration keywords for broader detection.

Signed-off-by: Coco Sheng <cocosheng@google.com>

cocosheng-g added 6 commits February 4, 2026 17:36

fix: address YAML linting issues and apply formatting fixes

3864f93

fix: address ratchet unpinned references and shell quoting

c34479d

fix: add explicit GITHUB_TOKEN permissions

4657361

fix: regenerate package-lock.json with public registry URLs

70d3297

cocosheng-g requested review from a team as code owners February 5, 2026 17:58

cocosheng-g requested review from MJjainam, R2wenD2, bdmorgan and verbanicm and removed request for a team February 5, 2026 17:58

cocosheng-g removed request for MJjainam, R2wenD2, bdmorgan and verbanicm February 5, 2026 17:59

cocosheng-g added 3 commits February 5, 2026 14:10

chore: refine prompt instructions for triage and fixer workflows

d68d044

- Update triage guidelines for stricter handling of spam and ambiguity. - Refine fixer validation step to use explicit keywords for out-of-scope cases. - Improves evaluation pass rates for edge cases.

chore: remove redundant package changes already present in parent PR

c9585fe

style: fix json formatting in eval datasets

a9b6714

cocosheng-g requested review from jerop and kschaab February 5, 2026 19:20

Base automatically changed from feat/eval-framework to main February 9, 2026 15:23

cocosheng-g added 5 commits February 25, 2026 09:35

ci: fallback to GOOGLE_API_KEY in nightly evals if GEMINI_API_KEY is …

59cd913

…not set

test: add @google/gemini-cli as dev dependency to stabilize evals

1bb9df0

ci: use local gemini-cli and fix GEMINI_API_KEY fallback in nightly e…

ff76d9f

…vals

cocosheng-g added 19 commits February 25, 2026 12:38

ci: fix authentication in nightly evals by supporting vertex-ai fallb…

92880d4

…ack in TestRig

Revert "test: add @google/gemini-cli as dev dependency to stabilize e…

f44ae67

…vals" This reverts commit 1bb9df0.

ci: pin gemini-cli version to 0.29.7 and restore install step

d2f3f96

ci: improve robustness of nightly evals with retries and stable runner

dea5660

ci: debug secrets availability

cca0319

ci: remove debug step from nightly evals

32aabf4

ci: reduce nightly eval matrix to only gemini-3-pro and flash preview…

6004732

… models

ci: enforce 90% pass rate threshold for evals

49c3d91

test: reduce vitest maxConcurrency to 1 to prevent API rate limits

b2e4b1e

test: increase vitest concurrency and thread pool for faster evals on…

c640167

… standard runners

ci: use gemini-cli-ubuntu-16-core runner for faster evaluations

3694dd0

test: revert concurrency optimizations to ensure stability

b2f8bb9

ci: revert to standard ubuntu-22.04 runner

85f1a45

test: increase vitest maxConcurrency while keeping issue-fixer tests …

e1fa689

…sequential

test: make assertions more robust to non-deterministic LLM outputs to…

6befe92

… improve pass rate

fix(evals): increase timeout and refine tool detection

7b82547

- Increase testTimeout to 15m to handle complex cross-file refactor tasks. - Add 'search' to tool exploration keywords for broader detection.

Update evals-nightly.yml

3db9987

Signed-off-by: Coco Sheng <cocosheng@google.com>

Merge branch 'main' into feat/eval-issue-219-triage

e6672ec

cocosheng-g enabled auto-merge (squash) February 26, 2026 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expand eval dataset with edge and complex cases and refine prompts#458

feat: expand eval dataset with edge and complex cases and refine prompts#458
cocosheng-g wants to merge 33 commits intomainfrom
feat/eval-issue-219-triage

cocosheng-g commented Feb 5, 2026 •

edited

Loading

Uh oh!

gemini-cli bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

cocosheng-g commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Evaluation Results (Post-Tuning)

Changes:

Uh oh!

gemini-cli bot commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

cocosheng-g commented Feb 5, 2026 •

edited

Loading